Introduction
Hello! And welcome to our first tutorial! In this tutorial, we will analyze data from Chicago Food Inspections from the years 2010-2019. We will visualize and try to create predictions of the data .

Here we load the data locally to be able to manipulate it and use it later.

chicago <- read_csv("/Users/ilanamakover/320/food-inspections.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Inspection ID` = col_double(),
##   `License #` = col_double(),
##   Zip = col_double(),
##   `Inspection Date` = col_datetime(format = ""),
##   Latitude = col_double(),
##   Longitude = col_double(),
##   `Historical Wards 2003-2015` = col_double(),
##   `Zip Codes` = col_double(),
##   `Community Areas` = col_double(),
##   `Census Tracts` = col_double(),
##   Wards = col_double()
## )
## See spec(...) for full column specifications.

Let’s take a look at a small portion of our data.

chicago[1:5, 1:22] %>%
  mutate(Violations = substr(Violations, 1, 50))%>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Inspection ID DBA Name AKA Name License # Facility Type Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Location Historical Wards 2003-2015 Zip Codes Community Areas Census Tracts Wards
2288309 BLACK EAGLE CLUB BLACK EAGLE CLUB 2653260 Restaurant Risk 1 (High) 1938 W IRVING PARK RD CHICAGO IL 60613 2019-05-07 License Fail
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.95428 -87.67788 {‘longitude’: ‘-87.67788273287823’, ‘latitude’: ‘41.95427796120616’} 13 21186 46 622 18
2288284 NIBBLES & NOSH/MIDNIGHT MAC & CHEESERIE/XO MARSHMA NIBBLES & NOSH/MIDNIGHT MAC & CHEESERIE/XO MARSHMA 2658934 Restaurant Risk 1 (High) 6977-6981 N SHERIDAN RD CHICAGO IL 60626 2019-05-07 License Pass w/ Conditions
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
42.00903 -87.66189 {‘longitude’: ‘-87.66188729620244’, ‘latitude’: ‘42.00902867704194’} 3 21853 10 267 5
2288315 ROBERT’S PIZZA AND DOUGH COMPANY ROBERT’S PIZZA AND DOUGH COMPANY 2617087 Restaurant Risk 1 (High) 411 E ILLINOIS ST CHICAGO IL 60611 2019-05-07 License Pass w/ Conditions
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.89095 -87.61719 {‘longitude’: ‘-87.61718975017543’, ‘latitude’: ‘41.89095012024265’} 22 21182 37 534 36
2288303 THE FAT SHALLOT THE FAT SHALLOT 2373569 Mobile Food Preparer Risk 2 (Medium) 2300 S THROOP ST CHICAGO IL 60608 2019-05-07 License Fail
  1. CITY OF CHICAGO FOOD SERVICE SANITATION CERTIFI
41.85045 -87.65880 {‘longitude’: ‘-87.65879785567869’, ‘latitude’: ‘41.85045102427’} 8 14920 33 126 26
2288325 ROBERT’S PIZZA AND DOUGH COMPANY ROBERT’S PIZZA AND DOUGH COMPANY 2617143 Restaurant Risk 1 (High) 411 E ILLINOIS ST CHICAGO IL 60611 2019-05-07 License Pass w/ Conditions NA 41.89095 -87.61719 {‘longitude’: ‘-87.61718975017543’, ‘latitude’: ‘41.89095012024265’} 22 21182 37 534 36


Okay. great. We have the names of the establishment and other relavent information about the inspection including but not limited to address, risk level, result of inspections and violations.

Data Cleaning

We are only interested in establishments that either pass, pass with conditions or fail, therefore we are going to remove any business that didn’t fit in these categories.

Data from 2019 is removed as 2019 only has 5 months worth of data compared to the data collected between 2010 and 2018 where each of those years has twelve months worth of data.

New columns are created to help with later computations. For example, a year column is added where we extract the year from the Inspection Date column.

A risk value column is added to discretize risk. The Risk column has 3 categories: “Risk 1 (High)”, “Risk 2 (Medium)”, and “Risk 3(Low)”. We create a new column which converts the 3 categories into 3 numeric values. 3 represents Risk 1 (High), 2 represents Risk 2 (Medium) and 1 represents Risk 3 (Low).

A results value column is added to discretize results. We create a new column which converts the 3 categories of the result of inspection, into 3 numeric values. If the establishment fails inspections, the score is 1, if the establishments passes with a condition, the score is 2 and if the establishment passes, the score is 3.

Write a comment about how/why we created more columns

chicago <- chicago %>%
  #filter out things we do not need or want
  filter(Results != "Not Ready" & Results != "Business Not Located" & Results != "No Entry" & Results 
         != "Out of Business" & Results != "NA" & Risk != "All" & `Facility Type` != "NA")%>%
  
  #add comments about what we are doing here
  mutate(Pass = ifelse(Results == "Pass" | Results =="Pass w/ Conditions", 1, 0))%>%
  mutate(countBakeryPass = ifelse(`Facility Type` == "Bakery", 
                                  ifelse(Results != "Fail", 1, 0), 0))%>%
  mutate(countGSPass = ifelse(`Facility Type` == "Grocery Store", 
                                  ifelse(Results != "Fail", 1, 0), 0))%>%
  

  #grab the year of the inspection date
  mutate(year = as.numeric(gsub("(^[0-9]{4})(-[0-9]{2}-)([0-9]{2})", "\\1", `Inspection Date`))) %>%
  
  filter(year != 2019)%>%
  #adding risk value
  mutate(`Risk Value` = ifelse(Risk == "Risk 1 (High)", 3, 
                               ifelse(Risk == "Risk 2 (Medium)", 2, 1))) %>%
  
  #adding results value
  mutate(`Results Value` = ifelse(Results == "Fail", 1, 
                                  ifelse(Risk == "Pass w/ Conditions", 2, 3)))

Here is a small snippet of what our new dataset looks like.

chicago[1:5, 1:28] %>%
  mutate(Violations = substr(Violations, 1, 50))%>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Inspection ID DBA Name AKA Name License # Facility Type Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Location Historical Wards 2003-2015 Zip Codes Community Areas Census Tracts Wards Pass countBakeryPass countGSPass year Risk Value Results Value
2243951 DONGPO IMPRESSION DONGPO IMPRESSION 2551932 Restaurant Risk 1 (High) 228 W CERMAK RD CHICAGO IL 60616 2018-12-31 Complaint Fail
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.85294 -87.63322 {‘longitude’: ‘-87.6332244476211’, ‘latitude’: ‘41.85294483663146’} 8 21194 35 3 26 0 0 0 2018 3 1
2243947 DOMINO’S DOMINO’S 2637330 Restaurant Risk 2 (Medium) 509 N ORLEANS ST CHICAGO IL 60654 2018-12-31 License Pass NA 41.89108 -87.63687 {‘longitude’: ‘-87.63687267718794’, ‘latitude’: ‘41.8910754076385’} 22 4446 37 652 36 1 0 0 2018 2 3
2243952 RANALLIS RANALLIS 1738099 Restaurant Risk 1 (High) 1508-1512 W BERWYN AVE CHICAGO IL 60640 2018-12-31 Canvass Pass w/ Conditions
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.97812 -87.66876 {‘longitude’: ‘-87.6687558692362’, ‘latitude’: ‘41.97812335776488’} 46 22616 76 564 24 1 0 0 2018 3 3
2243942 DUNKIN DONUTS DUNKIN DONUTS 2536449 Restaurant Risk 2 (Medium) 309 W CHICAGO AVE CHICAGO IL 60654 2018-12-31 Canvass Pass w/ Conditions
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.89646 -87.63610 {‘longitude’: ‘-87.63609804227903’, ‘latitude’: ‘41.8964577550414’} 22 4446 37 652 36 1 0 0 2018 2 3
2243937 WALGREENS #2025 WALGREENS #2025 18593 Grocery Store Risk 3 (Low) 3000 S HALSTED ST CHICAGO IL 60608 2018-12-31 Canvass Pass
  1. PHYSICAL FACILITIES INSTALLED, MAINTAINED & CL
41.83977 -87.64647 {‘longitude’: ‘-87.64646751112869’, ‘latitude’: ‘41.839771946583845’} 26 14920 58 59 48 1 0 1 2018 1 3
Exploratory Data Analysis: Data Visualization

In this section, we are visualizing the data.
Below, we create an interactive map of Chicago. You will be able to scroll through Chicago and see which business passes or fails inspection, which business has a high or low risk level, where these businesses are located and the name of the business

To get a better understanding of the data, we sample 2000 entries.

Here is a quick guide on how to interpret the map below:

Icon: Indicates Result
Pass- thumbs-up
Fail- thumbs-down
Pass with Conditions- check-circle

MarkerColor: Indicates Risk Value
Highest = 3 = red
Medium = 2 = white
Low = 1 = blue

The business name pops up when the icon is clicked.
The cluster colors bear no significance to the data.

#sampling 2000 entries
dat <- chicago %>%
  filter(!is.na(`Longitude`)) %>%
  sample_n(2000)


getResult <- function(dat) {
  ifelse(dat$Results == "Pass", 'thumbs-up', 
         ifelse(dat$Results == "Fail", 'thumbs-down', 'check-circle'))
}

getRiskValue <- function(dat) {
  ifelse(dat$`Risk Value` == 3, 'red',
         ifelse(dat$`Risk Value` == 1, 'white', 'blue'))
}

#icon functions
icons <- awesomeIcons(
  icon = getResult(dat),
  #iconColor = getResultColor(dat),
  markerColor = getRiskValue(dat),
  library = 'fa'
)

leaflet(dat) %>%
  addTiles() %>%
  addAwesomeMarkers(~Longitude,
                    ~Latitude,
                    icon = icons,
                    clusterOptions = markerClusterOptions(),
                    popup = ~(`DBA Name`))%>%addFullscreenControl()


The map is helpful to visualize small clusters of businesses in Chicago. Although, it’s hard to get an idea of the results as we have to scroll in to view the distributions of only a small portion of businesses. Just by looking at the map, it’s hard to grasp how many businesses actually pass, pass with conditions and fail. Also, since this map only samples 2000 entries across 8 years, it makes it even more difficult to understand the disrtibution of all pass, fail, and pass with conditions.


Instead, let’s look at a graph representation of results between 2010 - 2018.

chicago %>%
  ggplot(aes(x=Results, fill = Results))+
  geom_bar()


That looks better! We can see that overall, there are more pass than there are fails and pass with conditions. However, we see all the results from 2010 - 2018.

To get a better understanding of how businesses are performing over the years, let’s break down the results by year using cut() and breaks = 9 based on year. This will make a plot with 9 small plots (similiar to the one above) within it. One plot for each year!

Ask about years

chicago %>%
  
  #create 8 time periods by year
  mutate(discrete_period = factor(year)) %>%
  
  ggplot(aes(x=Results, fill = Results)) +
  geom_bar() +
  
  facet_wrap(~discrete_period) +
  theme(legend.position = "top")+
  
  #make the labels
  ggtitle("Results vs Year")


We can now get a better idea of the amount of passing businesses per year. As we can see from the plots above, there is a trend of more passing businesses than failing businesses. We can also see that passing with conditions increases across the years.

Now, let’s look at how risks, high, low and medium, are broken down by year.

chicago %>%
  
  #create 8 time periods by year
  mutate(discrete_period = factor(year)) %>%
  
  #group by risk
  group_by(Risk) %>%
  
  #
  ggplot(aes(x=Risk, fill = Risk)) +
  geom_bar() +
  
  facet_wrap(~discrete_period) +
  theme(legend.position = "top")+
  ggtitle("Risk vs Year")

logistic regression facility tpe
There seems to be no trend between risk and year.

Another way to visualize the pass and pass with conditions per year, is to create a scatter plot, where each plot presents the mean of passes per year.

chicago %>%
  group_by(year)%>%
  summarize(passMean = mean(Pass), total = sum(Pass))%>%
  ggplot(aes(x=total, y = passMean))+
  geom_point()+theme_minimal()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


From the above graph, it is evident that as the years progress, the average number of passes increases minimally.
Let’s look at some other information and see the distribution of the passing inspections, both passing and passing with conditions, of Bakeries in Chicago over time.

bakeries <-chicago %>%
  filter(`Facility Type` == "Bakery")%>%
  group_by(year)%>%
  summarize(passBakeryNum = mean(countBakeryPass))%>%
  ggplot(aes(x=year, y = passBakeryNum))+
  geom_point()+theme_minimal()+geom_smooth()

bakeries
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


It seems that the average of passing bakeries is unchanging per year. On average the proportion of passing bakeries is stagnent.

Since there does not appear to be a relationship between passing bakeries and year, we will try plotting the distribution of grocery stores passing inspections (which includes passing with conditions), over time.

chicago %>%
  filter(`Facility Type` == "Grocery Store")%>%
  group_by(year)%>%
  summarize(passGSNum = mean(countGSPass))%>%
  ggplot(aes(x=year, y = passGSNum))+
  geom_point()+theme_minimal()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There seems to be a releationship between grocery stores passing over time however, it does not seem to be linear. We will test this by creating a linear regression model to ensure that this relationship is non-linear. If this relationship turns out to be non-linear, we will try exploring a logistic regression model to see if we get better results!

Experiment Design and Hypothesis

Based on the Grocery Stores Passing vs Year graph, we believe that there is no linear correlation between grocery stores passing and year. One popular way of determining if there is a correlation is to reject the hypothesis also known as the Null hypothesis.

Our null hypothesis is that there is a linear correlation between grocery stores passing and year.
To test our hypothesis, we build a linear model between grocery stores passing and year. Linear models are used in data analysis. It allows to us to construct confidence intervals and hypothesis testing between variables.

In order to test only grocery stores, we will create a new dataset from the Chicago dataset grouped by year. We create a new attribute -> mean_pass which is the mean of Pass per year. From this new dataset, we create a linear regression model between mean_pass and year.

Should we use mean or sum for our model

stores_mean <- chicago %>%
  filter(`Facility Type` == "Grocery Store") %>%
  group_by(year) %>%
  summarize(mean_pass = mean(Pass)) %>%
  select(year, mean_pass)

#put in a different code block
mean_lm <- lm(mean_pass ~ year, data=stores_mean)

summary(mean_lm)
## 
## Call:
## lm(formula = mean_pass ~ year, data = stores_mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05683 -0.01837  0.00114  0.03365  0.04454 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.292775   9.515287  -1.502    0.177
## year          0.007460   0.004725   1.579    0.158
## 
## Residual standard error: 0.0366 on 7 degrees of freedom
## Multiple R-squared:  0.2626, Adjusted R-squared:  0.1573 
## F-statistic: 2.493 on 1 and 7 DF,  p-value: 0.1583


Multiple R-squared is used to evaluate how well the model fits the data. It identifies how much of the variance in the predicated variable, or dependent variable, can be explained by independent variable. For example, our R-squared value of 0.2626 can explain 0.2626 % of the variation in the outcome. Therefore, this indicates there is close to no correlation between passing and year. We concur with our null hypothesis and reject our original hypothesis.

Let’s try to see if the data is skewed, meaning let’s check if the mean is either less than or greater than the median.

# The best way to extract these values is to make a data frame from the linear model
quantiles <- quantile(mean_lm$residuals)

quantiles
##           0%          25%          50%          75%         100% 
## -0.056827652 -0.018365843  0.001139828  0.033652606  0.044544230

To determine if there is a skew in the data, we need to check that the First Quartile, or the 25% column, and the Third Quartile, the 75% column, are equal distance from the median. To determine that, you can simply subtract the median from the third quartile and check if that value is equal to the first quartile subtracted from the median.

In short we have to check:

75% column - median = median - 25% column

0.3662738 - 0.2904350 = ?0.2904350 - -0.5578873

0.0758388 =? 0.8483223

Since the right side is not equal to the left side, the data is skewed. handling skew

Let’s construct a confidence interval so we can say how precise we think our estimates of “*****”. We want to see how precise our estimate is. We will calculate a standard error estimate for B1 and we will construct a 95% confidence interval


#gather the stats from the linear_model using tidy() which is defined by the broom package.
stats <- mean_lm%>%
  tidy() %>%
  select(term, estimate, std.error)
stats
## # A tibble: 2 x 3
##   term         estimate std.error
##   <chr>           <dbl>     <dbl>
## 1 (Intercept) -14.3       9.52   
## 2 year          0.00746   0.00472
confidence_interval_offset <- 1.95 * stats$std.error[2]
confidence_interval <- round(c(stats$estimate[2] - confidence_interval_offset,
                               stats$estimate[2],
                               stats$estimate[2] + confidence_interval_offset), 4)

confidence_interval
## [1] -0.0018  0.0075  0.0167

Discuss what everything means +-

mean_augment <- mean_lm %>%
  augment()

mean_augment %>% head()
## # A tibble: 6 x 9
##   mean_pass  year .fitted .se.fit   .resid  .hat .sigma .cooksd .std.resid
##       <dbl> <dbl>   <dbl>   <dbl>    <dbl> <dbl>  <dbl>   <dbl>      <dbl>
## 1     0.684  2010   0.703  0.0225 -0.0184  0.378 0.0384 0.123       -0.636
## 2     0.673  2011   0.710  0.0187 -0.0374  0.261 0.0353 0.250       -1.19 
## 3     0.710  2012   0.717  0.0154 -0.00768 0.178 0.0394 0.00579     -0.231
## 4     0.759  2013   0.725  0.0131  0.0345  0.128 0.0365 0.0747       1.01 
## 5     0.766  2014   0.732  0.0122  0.0337  0.111 0.0367 0.0595       0.975
## 6     0.784  2015   0.740  0.0131  0.0445  0.128 0.0344 0.124        1.30

describe what we are looking for.

mean_augment %>%
  ggplot(aes(x=.fitted,y=.resid)) +
    geom_point() + 
    geom_smooth() +
    labs(x="fitted", y="residual")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As we can see above, there is no linear relationship between grocery stores passing and year. So we will try using a logisic model using a few more facility types.

get data note

We are creating a new dataset called facilities which only stores information on bakeries, grcoery stores and restaurants.

facilities <- chicago %>%
  select(year, `Facility Type`, Results, Pass)%>%
  filter(`Facility Type` == "Bakery" | `Facility Type` == "Grocery Store" | `Facility Type` == "Restaurant") 

Great! Now that we grabbed those attributes, we want to know if there is a correlation between facility type and year with passing results.

#We are trying to see if there is a relationship for results and year.
logistic_model = glm(formula = Pass ~ year+`Facility Type`, data = facilities, family="binomial")

summary(logistic_model)
## 
## Call:
## glm(formula = Pass ~ year + `Facility Type`, family = "binomial", 
##     data = facilities)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7990   0.6648   0.6817   0.7047   0.8142  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -36.51254    5.41767  -6.740 1.59e-11 ***
## year                           0.01870    0.00269   6.951 3.64e-12 ***
## `Facility Type`Grocery Store  -0.13362    0.05093  -2.624 0.008700 ** 
## `Facility Type`Restaurant      0.17998    0.04904   3.670 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136310  on 129311  degrees of freedom
## Residual deviance: 135921  on 129308  degrees of freedom
## AIC: 135929
## 
## Number of Fisher Scoring iterations: 4
#We are trying to see if there is a relationship for results and year.
interactive_logistic_model = glm(formula = Pass ~ `Facility Type`*year, data = facilities, family="binomial")

summary(interactive_logistic_model)
## 
## Call:
## glm(formula = Pass ~ `Facility Type` * year, family = "binomial", 
##     data = facilities)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7880   0.6722   0.6835   0.6987   0.8513  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                        2.318e+00  3.930e+01   0.059   0.9530  
## `Facility Type`Grocery Store      -9.903e+01  4.134e+01  -2.396   0.0166 *
## `Facility Type`Restaurant         -2.605e+01  3.977e+01  -0.655   0.5124  
## year                              -5.836e-04  1.951e-02  -0.030   0.9761  
## `Facility Type`Grocery Store:year  4.911e-02  2.052e-02   2.393   0.0167 *
## `Facility Type`Restaurant:year     1.303e-02  1.975e-02   0.660   0.5095  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136310  on 129311  degrees of freedom
## Residual deviance: 135893  on 129306  degrees of freedom
## AIC: 135905
## 
## Number of Fisher Scoring iterations: 4
interactive_logistic_model_stats <- interactive_logistic_model %>% tidy()
interactive_logistic_model_stats
## # A tibble: 6 x 5
##   term                                estimate std.error statistic p.value
##   <chr>                                  <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)                         2.32       39.3       0.0590  0.953 
## 2 `Facility Type`Grocery Store      -99.0        41.3      -2.40    0.0166
## 3 `Facility Type`Restaurant         -26.1        39.8      -0.655   0.512 
## 4 year                               -0.000584    0.0195   -0.0299  0.976 
## 5 `Facility Type`Grocery Store:year   0.0491      0.0205    2.39    0.0167
## 6 `Facility Type`Restaurant:year      0.0130      0.0197    0.660   0.509
year<-interactive_logistic_model_stats$estimate[4]

estimate<- interactive_logistic_model_stats%>%
  mutate(`Facility Type` = term, estimate = estimate)%>%
  select(`Facility Type`, estimate)%>%
  slice(5:6)
  
estimate_df <- data.frame(estimate)


estimate_df <- estimate_df %>%
  mutate(increase = year+ estimate)

estimate_df
##                       Facility.Type  estimate   increase
## 1 `Facility Type`Grocery Store:year 0.0491134 0.04852979
## 2    `Facility Type`Restaurant:year 0.0130257 0.01244209

When bakeries is the intercept, grocery stores on average pass 0.04852979 % more per year than bakeries while Restaurants pass on average 0.01244209% more than bakeries.

interaction over time maybe plot take year as factor to get separate estiate for each year so non linear

augmented_logistic <- logistic_model %>%
  augment()

augmented_logistic %>% head()
## # A tibble: 6 x 10
##    Pass  year Facility.Type .fitted .se.fit .resid    .hat .sigma .cooksd
##   <dbl> <dbl> <chr>           <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
## 1     0  2018 Restaurant       1.40  0.0130 -1.80  2.69e-5   1.03 2.72e-5
## 2     1  2018 Restaurant       1.40  0.0130  0.665 2.69e-5   1.03 1.66e-6
## 3     1  2018 Restaurant       1.40  0.0130  0.665 2.69e-5   1.03 1.66e-6
## 4     1  2018 Restaurant       1.40  0.0130  0.665 2.69e-5   1.03 1.66e-6
## 5     1  2018 Grocery Store    1.08  0.0196  0.763 7.29e-5   1.03 6.17e-6
## 6     1  2018 Restaurant       1.40  0.0130  0.665 2.69e-5   1.03 1.66e-6
## # … with 1 more variable: .std.resid <dbl>
augmented_logistic %>%
  ggplot(aes(x=.fitted,y=.resid)) +
    geom_point() + 
    geom_smooth() +
    labs(x="fitted", y="residual")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'